Information retrieval for OCR documents: a content-based probabilistic correction model
نویسندگان
چکیده
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost” retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original “uncorrupted” document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. Evaluation using the TREC standard testing collections indicates that our method significantly improves the performance compared with simple word correction approaches such as using only the top ranked correction.
منابع مشابه
A Content-based Probabilistic Correction Model for OCR Document Retrieval
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost...
متن کاملRetrieving Arabic Printed Document: a Survey
This paper surveys some of the literature pertaining to searching and retrieving OCR’ed printed documents with emphasis on Arabic documents. It examines peculiarities of Arabic morphology, orthography, retrieval, word clustering, display, OCR, and error correction. The paper surveys existing evaluation test-beds for retrieval of Arabic OCR texts. Lastly, it concludes with possible directions fo...
متن کاملProbabilistic Logical Information Retrieval for Content, Hypertext, and Database Querying
Classical retrieval models support content-oriented searching for documents using a set of words as data model. However, in hypertext and database applications we want to consider the link structure and attribute values of documents in addition to the pure content. In this paper, we present a framework based on probabilistic logical retrieval for describing the retrieval function for a query wh...
متن کاملRetrieving Images of Scanned Text Documents
Information retrieval is the task of nding documents, usually text, which are relevant to a user's information need. A conventional approach to information management of paper documents is normally based on classifying them into a hierarchical classiication structure. More recently we have seen electronic document management systems which manage scanned images of documents in the same way as pa...
متن کاملA Survey on Various Word Spotting Techniques for Content Based Document Image Retrieval
Searching documents for information and retrieval of relevant documents is a basic activity. Various tools are readily available for searching and retrieval from digital documents, but not much robust methods are available for retrieval from historic documents and old manuscripts as they are not digitized but available in scanned formats. Conventional way of retrieval from scanned document imag...
متن کامل